DataFrame Joins

We will discuss about following join types in this post:
  • CROSS JOIN
  • INNER JOIN
  • LEFT OUTER JOIN
  • RIGHT OUTER JOIN
  • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • ANTI LEFT JOIN
Joining data between DataFrames is one of the most common multi-DataFrame transformations. The standard SQL join types are all supported and can be specified as the joinType in df.join(otherDf, sqlCondition, joinType) when performing a join. Spark’s supported join types are “inner,” “left_outer” (aliased as “outer”), “left_anti,” “right_outer,” “full_outer,” and “left_semi.” With the exception of “left_semi” these join types all join the two tables, but they behave differently when handling rows that do not have keys in both tables.

val df1 = Seq( (101, "sachin",40),
                        (102, "zahir",41),
                        (103, "virat",29),
                        (104, "saurav",41),
                        (105,"rohit",30)).toDF("id", "name","age")

val df2 = Seq((101, "batsman"),
                       (102, "bowler"),
                       (103, "batsman"),
                       (104, "batsman")).toDF("id", "skill")

val df3 = Seq((101, "sachin",100),
                      (103, "virat",50),
                      (104,  "saurav",45),
                      (105,"rohit",35)).toDF("id", "name","centuries")

Cross Join: This join is very expensive to perform as it creates (m*n) combination of rows , where m is number of rows in df1 and n is number of rows in df2.
val result_df=df1.crossJoin(df2)
Inner Join::Inner joins are used to fetch common data between 2 tables, in this case two dataframes. You can join 2 dataframes on the basis of some key column/s and get the required data into another output dataframe. The “inner” join is both the default.

join() operation will join two dataframes based on some common column which in the  column id from df1 and df2.
val dataframe=df1.join(df2, "id").show
val dataframe=df1.join(df2, df1("id") === df2("id"), "inner")

But, what if the column to join to had different names? In such a case, you can explicitly specify the column from each dataframe on which to join.

val  employee=emp.join(dept.,$ "dept.dept_id" === $"emp.edept_id").select($ "emp.employee_name".as("ename"),$"sal",$"address")

Left Join:Left outer join returns all the rows from table/dataframe on the left side and matching records from the right side dataframe. If the record does not exists on right side dataframe then in output you will see NULL as the values for non matching records.

val dataframe=df1.join(df2, df1("id") === df2("id"), "left_outer")
dataframe.show(5)
val dataframe=df1.join(df3, Seq("name"),"left")
dataframe.show(5)
Right Join:Right outer join is similar to LEFT outer join with only difference is that it brings all the records from dataframe on the right side and only matching records from dataframe on the left side. If the value is not matching then it will return NULL for left table column. You can use “rightouter” or just “right” in the jointype.
val dataframe=df1.join(df2, df1("id") === df2("id"), "right_outer")
dataframe.show(5)


val dataframe=df1.join(df3, Seq("name"),"right")
dataframe.show(5)


Full outer Join:Full Outer Join returns matching records from both the dataframes as well as non-matching records. Column values are set as NULL for non matching records in respective rows. You can use “outer”, “full” or “fullouter” as join type in the below query. All three means the same and will give same result.

val dataframe=df1.join(df2, df1("id") === df2("id"), "Full_outer")
dataframe.show(5)

val dataframe= df1.join(df3, Seq("name"),"full_outer")
dataframe.show(5)

Left Semi join: When the left semi join is used, all rows from the left dataset having their correspondence in the right dataset are returned in the final result. However, unlike left outer join, the result doesn't contain merged data from both datasets. Instead, it contains only the information (columns) brought by the left dataset:
 
val dataframe= df1.join(df2, df1("id") === df2("id"), "left_semi")
dataframe.show(5)

The last described type is left anti join. It takes all rows from the left dataset that don't have their matching in the right dataset.
Left anti join: This join is  exact opposite of the leftsemi,leftanti join returns only columns from the left dataset for non-matched records.

val dataframe= df1.join(df3, df1("name") === df3("name"), "left_anti")
dataframe.show(5)


No comments:

Post a Comment